On the Accuracy and Completeness of the Record Matching Process

نویسندگان

  • Vassilios S. Verykios
  • Mohamed G. Elfeky
  • Ahmed K. Elmagarmid
  • Munir Cochinwala
  • Siddhartha R. Dalal
چکیده

The role of data resources in today's business environment is multi-faceted. Primarily, they support the operational needs of an organization or a company. Secondarily, they can be used for decision support and management. The quality of the data, used to support the operational needs, is usually below the quality required for decision support and management. Recent advances in information systems investigate ways to estimate, improve and preserve the quality of data, so as to increase their value in both domains. Record matching or linking is one of the phases of the data quality improvement process, in which, records from diierent sources, are cleansed and integrated in a centralized data store to be used for various purposes. Both, earlier and recent studies in data quality and record linkage, focus on various statistical models which make strong assumptions on the probabilities of attribute errors. In this study, we evaluate diierent models for record linkage that are built based on data only. We use a program that generates data with known error distributions and we also train classiication models, which we use to estimate the accuracy and the completeness of the record linking process. The results, indicate that the automated learning techniques are adequate for this process and that both their accuracy and their completeness is comparable to the accuracy and the completeness of other, mostly manual, processes. 1. Introduction The enterprise data is created, used and shared by a corporation in conducting business. This is a critical business asset that must be designed, analyzed, and managed with data quality as a guiding principle and not as an afterthought 17]. Poor data quality, that results from missing customer information, wrong address information, etc., undermines the customer satisfaction, leads to high and unnecessary costs and more importantly impacts decision making 21, 20, 16]. A number of reasons is responsible for bad data quality including: (a) multiple sources of data, (b) incompatible data, (c) data from multiple level of granularity, (d) redundant data, (e) corrupted and noisy data, and (f) missing attribute values. Data quality is achieved in three stages: the rst one is based on the cleansing, or scrubbing of data, the second on the matching or the linking of the records and the third one on the consolidation or the integration of the cleansed information with other internal or external sources. In the rst stage, the data is parsed, corrected, standardized and enhanced …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analytical Comparison of Methods for Calculating the Completeness of VGI

Spatial data, which is one of the main needs of human societies from business organizations to the general users today, cannot meet the needs of a wide range of users without changing the structure of conventional methods of data registration and updating on a metropolitan scale. Open Street Map, as one of the most successful implementations of the crowdsourcing approach to spatial data with th...

متن کامل

باز پس گرفته شده: وضعیت تکمیل اوراق اختصاصی اعمال جراحی بیمارستان های دانشگاهی ارومیه (سال 1382)

Introduction: Information documented in the medical records includes demographic Factors, history, examinations and treatment. Information documented in the health record file has impressive Effect on the quality of care and treatment of patients. The pieces of information concerning patients who need operation are completed in special operation sheet that should be exact, on time and complete....

متن کامل

Validation of Volunteered Geographic Information Landuse Change Using Satellite Imagery

Land use change monitoring is one of the main concerns of managers and urban planners due to human activities and unbalanced physical development in urban areas. In this paper, a combination of remote sensing data and volunteered geographic information was used to assess the quality of volunteered geographic information on land use and land cover changes monitoring. For this purpose, the ORBVIE...

متن کامل

Assessment of the completeness of Volunteered Geographic Information focusing on building blocks data (Case Study: Tehran metropolis)

Open Street Map (OSM) is currently the largest collection of volunteered geographic data, widely used in many projects as an alternative to or integrated with authoritative data. However, the quality of these data has been one of the obstacles to the widely use of it. In this article, from among the elements related to the quality of volunteered geographic data, we have tried to examine the com...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000